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Abstract 

Often, high dimensional data lie close to a 
low-dimensional submanifold and it is of in- 
terest to understand the geometry of these 
submanifolds. The homology groups of a 
manifold are important topological invariants 
that provide an algebraic summary of the 
manifold. These groups contain rich topo- 
logical information, for instance, about the 
connected components, holes, tunnels and 
sometimes the dimension of the manifold. In 
this paper, we consider the statistical prob- 
lem of estimating the homology of a mani- 
fold from noisy samples under several differ- 
ent noise models. We derive upper and lower 
bounds on the minimax risk for this problem. 
Our upper bounds are based on estimators 
which are constructed from a union of balls of 
appropriate radius around carefully selected 
points. In each case we establish complemen- 
tary lower bounds using Le Cam's lemma. 

1 Introduction 

Let M be a d- dimensional manifold embedded in 'SP 
where d < D. The homology groups H{M) of M [TT] 
is an algebraic summary of the properties of M. The 
homology groups of a manifold describe its topologi- 
cal features such as its connected components, holes, 
tunnels, etc. 

In machine learning, there is much focus on cluster- 
ing. However, the clusters are only the zeroth order 
homology and hence only scratch the surface of the 
topological information in a dataset. Extracting in- 
formation beyond clustering is known as topological 
data analysis. It is worth emphasizing that the ho- 
mology groups are topological invariants of a manifold 
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that can be efficiently computed [H [5] . Examples of 
applications of homology inference have been growing 
rapidly in the last few years. Homology inference has 
foimd application in medical imaging and neuroscience 
[21 [H] , sensor networks [SJ HO] , landmark-based shape 
data analyses [TU] , proteomics [12] , microarray analysis 
[2] and cehular biology [M]. The books by [H [HI [23] 
contain various case studies in applications in fields 
ranging from computational biology to geophysics. 

In this paper we study the problem of estimating 
the homology of a manifold M from a noisy sample 
Yi, . . . ,Yn. Specifically, we bound the minimax risk 

Rn = inf sup Q" (n ^ n{M)) (1) 

where the infimum is over all estimators M. of the ho- 
mology of M and the supremum is over appropriately 
defined classes of distributions Q for Y . Note that 
< Rn < 1 with Rn — 1 meaning that the problem 
is hopeless. Bounding the minimax risk is equivalent 
to bounding the sample complexity of the best possible 
estimator, defined by n(e) = minjn : i?„ < e} where 
< e < 1. 

1.1 Related Work 

Other work on statistical homology includes that of 
Chazal et. al. [2] who show under certain conditions 
the homology estimate of a manifold from a sample 
is stable under noise perturbation that is small in a 
Wasserstein sense. Kahle il3^ studies the homology of 
random geometric graphs and proves many threshold 
and central limit theorems for their homology. Adler 
et. al. [T] study the homology induced by the level 
sets of certain Gaussian random fields. There is also a 
large literature on manifold denoising that focuses on 
aspects of the manifold not related to homology; see 
for instance 12J and references therein. 

Our upper bounds mainly generalize those in the work 
of Niyogi, Smale and Weinberger (henceforth NSW) 
[l6l [TT] . They establish a general result showing that 
when all the samples are dense in a thin region sur- 
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rounding the manifold, a union of appropriately sized 
balls around the samples can be used to construct an 
accurate estimate of the homology with high probabil- 
ity. Under a variety of different noise models we will 
show that even when all the samples are not close to 
the manifold it is possible to "clean" the samples (es- 
sentially removing those in regions of low-density) and 
be left with samples which are dense in a thin region 
around the manifold. 

In the case of additive noise with general noise dis- 
tributions however, we cannot expect too many sam- 
ples to fall close to the manifold. We will show that 
when the noise distribution is known one can use a 
statistical deconvolution procedure to obtain a "de- 
convolved measure" concentrated around the manifold 
from which we can in turn draw a small number of 
samples and apply the cleaning procedure described 
above to them. Deconvolution has been extensively 
studied in the statistical literature (see [9] and refer- 
ences therein). Most related to our application is the 
work of Koltchinskii [TS] who uses deconvolution to 
estimate the dimension and cluster tree of a distribu- 
tion supported on a submanifold. We defer a detailed 
comparison to Section [5.4.1| after the necessary prelim- 
inaries have been introduced. 

To the best of our knowledge, ours is the first paper to 
obtain lower and upper minimax bounds for the prob- 
lem of inferring the homology of a manifold. There are 
a few existing results on upper bounds. A summary 
of previous results and the results in this paper are in 
Table 1. 

Outline. In Section[2]we describe the statistical model. 
In Section[3]we give a brief description of homology. In 
Section |4] we give an overview of our techniques. We 
derive the minimax rates for the four noise settings 
in Section [5j Technical proofs are contained in the 
Appendix. 

2 Statistical Model 

We assume that the sample {Yi, . . . , y„} C con- 
stitutes a set of "noisy" observations of an unknown 
d-dimensional manifold M, with d < D, whose ho- 
mology we seek to estimate. The distribution of the 
sample depends on the properties of the manifold M 
as well as on the type of sampling noise, which we de- 
scribe below by formulating various statistical models 
for sampling data from manifolds. 

Notation. We let B^{x) denote a fc-dimensional ball 
of radius r centered at x. When k — D,we write Br{x) 
instead of B^{x). For any set M and any a > define 
tubeo-(M) ~ [J,^^j^,j Bcr{x). Let Vk denote the volume 
of the A;-dimensional unit ball. Finally, for clarity we 



let ci , C2 , . . . , Ci , C2 , . . . denote various positive con- 
stants whose value can be different in different expres- 
sions. The constants will be specified in the corre- 
sponding proofs. 

Manifold Assumptions. We assume that the un- 
known manifold M is a d-dimensional smooth compact 
Riemannian manifold without boundary embedded in 
the compact set X = [0, 1]^. We further assume that 
the volume of the manifold is bounded from above by a 
constant which can depend on the dimensions d, D, i.e. 
we assume vol(M) < Cdm- Compact d-dimensional 
manifolds without boundary typically reside in an am- 
bient dimension D > d, an assumption we will make 
throughout this paper. The main regularity condition 
we impose on M is that its condition number be not 
too small. The condition number k(M) (see 16J) is 
the largest number r such that the open normal bundle 
about M of radius r is imbedded in for every r < t. 
For r > let M = A^(t) = |m : k{M) > r| denote 
the set of all such manifolds with condition number no 
smaller than r. A manifold with large condition num- 
ber does not come too close to being self-intersecting. 
We consider the collection V = V{M) = V{M,a) of 
all probability distributions supported over manifolds 
M in having densities p with respect to the volume 
form on M uniformly bounded from below by a con- 
stant a > 0, i.e. < a < p{x) < 00 for all x E M . 
For expositional clarity we treat a as a fixed constant 
although our upper and lower bounds match in their 
dependence on a. 

The Noise Models. We consider four noise models 
and, for each of them, we specify a class Q of proba- 
bility distributions for the sample. 

Noiseless. We observe data Yi , . . . , F„ ^ P where P E 
V. In this case, Q = Q(t) = V. 

Clutter Noise. We observe data Yi, . . . , y„ from the 
mixture Q = (1 - 7r)C/ + ttP where, PeT', 0<7r<l 
and /7 is a uniform distribution on X. The points 
drawn from U are called background clutter. Then 

Q ^ Q(7r,T) = |g = (l-7r)C/ + 7rP : P G p|. Notice 

that TT — 1 reduces to the noiseless case. 

Tubular Noise. We observe Yi, . . . , y„ ~ Qm,it where 
Qm.o is uniform on a tube of size a around M. In this 

case Q = Q((t, t) = ^Qm^o ■ A/ e A^}. 

Additive Noise. The data are of the form Yi = Xi + ei, 
where ATi , . . . , ^ P, for some P e P, and ei , . . . , e„ 
are a sample from a noise distribution $. Note that 
Q = P T^r $, that is, Q is the convolution of P and 
We consider two cases: 

1. $ is a /^-dimensional Gaussian with mean 
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Table 1: Summary of our contributions 



(0, . . . , 0) and covariance cr^J, with cr <C t. Define 

2. $ is any known noise distribution whose Fourier 
transform is bounded away from but with the 
added restriction that we only consider manifolds 
with T being a fixed constant. Then Q = Q(<i>) = 

|q = : p e where Vr is the subset of 

V comprised of distributions supported on man- 
ifolds M with condition number at least as large 
as the fixed value r. 

The noise model used in [ITj is to take the noise at any 
point to be only along the normal fibres; this seems 
unnatural and we will not consider that model here. 

In almost all of the distribution classes considered we 
allow for T to vanish as n gets bigger, which is equiv- 
alent to letting the difficulty of the statistical problem 
increase with the sample size. To this end, we will also 
analyze the quantity r„ = r„(e) = inf{r : i?„ < e}, 
which corresponds to the smallest condition number 
that permits accurate estimation. We call this the 
resolution. 
3 Homology 

Often in our paper we will use phrases like "the homol- 
ogy of the union of balls around samples" . In this sec- 
tion we explain this usage and discuss briefly simplicial 
homology (see Hatcher (2001) for a detailed treatment) 
and its computation. 

The homology H of a space 5 is a collection of groups 
that correspond to topological features of S. In what 
follows, it might help the reader's intuition to imagine 
that we are starting with a dense sample of points U on 
a manifold and building a collection of simplices from 
these points. The union of balls [Jy^jj B^{y) gives a 
geometric approximation to the underlying manifold. 
This is however a continuous (infinite) collection of 
points. To make computation tractable we need to be 
able to reduce the computation of homology from a 
continuous space to its discretization. The Cech com- 
plex (a particular simplicial complex, see Figure [s]) 
which is described below gives a discrete representa- 
tion of the union of balls. A classic result in topology 
called the Nerve Theorem [TT] states that the homol- 
ogy of {Jy^^ B^{y) is identical to the homology of the 
corresponding Cech complex. 

We now describe a simplicial complex and its homol- 



ogy. A simplicial complex is a hereditary set system K, 
over a vertex set V, i.e. a C a' G K- implies that a G JC. 
The dimension of a simplex cr is |f7| — 1; singletons are 
0-simplices or vertices, pairs in /C are 1-simplices or 
edges, triples are 2-simplices or triangles, etc. A p- 
chain is a formal sum of p-simplices. The coefficients 
are taken in Z2, the integers mod 2j^ Thus, chains may 
be viewed as subsets of simplices and addition (mod 
2) as symmetric difference of sets. Addition of chains 
forms an abelian group called the chain group Cp with 
denoting the empty chain. 

A p-simplex a = {vq, . . . ,Vp} has p + 1 simplices of di- 
mension p — 1 on its boundary, denoted ai = cr \ {vi}. 
The boundary of a simplex is dpa = Y^^i=o'^i- The 
boundary operator dp : Cp ^ Cp^i is the natural ex- 
tension of the boundary of a simplex to the boundary 
of a chain: dpC = J^aec dp(J- 

The kernel and image of the boundary operator are 
two important subgroups of the chain group: the cycle 
group: Zp ~ kerSp = {z G Cp : dpZ = 0}, and the 
boundary group: Bp = 'm\ dp = {dp+ic : c € Cp+i}. 
The cycles Zp are those chains that have boundary 0. 
The boundary cycles Bp are those p-chains that are the 
boundary of some p + 1-chain. It is easy to check that 
dp-idpC = and thus Bp C Zp C Cp. See Figure [l] 

Two cycles zi, Z2 G Zp are homologous if zi — Z2 G Bp, 
i.e. their difference is the boundary of a p -I- 1-chain. 
The pth homology group Hp is defined as the quo- 
tient group Zp/Bp. That is, the homology group is 
a collection of equivalence classes of cycles. The first 
homology group Hq corresponds to connected compo- 
nents (clusters). The next homology group Hi corre- 
sponds to cycles (or loops). Higher order homology 
groups correspond to equivalence classes of higher di- 
mensional cyclesj^ The homology of /C is the collection 
H of all its homology groups. 

The Cech complex is a specific simplicial complex de- 
fined as follows. Fix some e > and a set of points 
S C M^. The Cech complex consists of all simplices 
a such that f]^^^ B^{x) 7^ where B^{x) is a ball of 
radius e centered at x. See Figure |3] 



^In general, homology may be defined over any ring, but 
we stick with Z2 for ease of exposition and computation. 

^ Intuitively, boundary cycles are "filled in" cycles and 
two cycles are homologous if one cycle can be deformed 
into the other cycle. 
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Figure 1: Relationship between chains Cp, cycles — ker and boundaries = im 9p+i. The chains Cp 
are just collections of simplices. The chains in are the cycles. The cycles in B^, are the cycles that happen to 
be boundaries of chains in Cp+i. 




Figure 2: The sum of two 1-cycles is another 1-cycle. 
Here the cycles are homologous because their sum (in 
Z2)is the boundary of a 2-chain of triangles. 




Figure 3: 
complex. 



A union of balls and its corresponding Cech 



Since the coefficient ring is a field, the computations 
may be completely described by linear algebra. The 
groups Cp, Zp, Bp, and iJp are vector spaces and the 
boundary operators are linear maps. It is possible to 
efficiently compute the homology groups of a simplicial 
complex in time polynomial in the size of the complex. 
The algorithm only involves row reduction on the ma- 
trix representations of 9p. 

4 Techniques 

4.1 Techniques for lower bounds 

The total variation distance between two measures P 
and g is defined by TV(P, g) = sup^ \P{A) - Q{A)\ 
where the supremum is over all measurable sets. It 
can be shown that TV(P, Q) = P{G) - Q{G) = 1 - 
Jmin(P, g) where G = {y : p{y) > q{y)} and p and 
q are the densities of P and Q with respect to any 
measure fi that dominates both P and Q. 



we now state (see, e.g., [H]). 

Lemma 1 (Le Cam). Let Q be a set of distributions. 
Let 9{Q) take values in a metric space with metric p. 
Let Qi,Q2 € Q be any pair of distributions in Q. Let 
Yi, . . . ,Yn be drawn iid from some Q & Q and denote 
the corresponding product measure by g" . Then 



inf sup Equ p{e,e{Q)) 
e QeQ L 



> 



-p{0{Q,),d{Q2)) (i-TV(gi,g2))2" 

where the infimum is over all estimators. 

Le Cam's lemma makes precise the intuition that if 
there are distinct members of the class Q for which 
the data generating distributions are close then the 
statistical problem is hard given a small sample. 

When we apply Le Cam's lemma in this paper, Qi 
and g2 will be associated with two different manifolds 
Ml and M2. We will take 0{Q) to be the homology of 
the manifold and p{d{Qi),9{Q2)) = 1 if the homolo- 
gies are the different and p{O{Qi),0{Q2)) = if the 
homologies are the same. The subtlety of establishing 
tight lower bounds boils down to the task of finding a 
set of distributions in the class Q for which the homol- 
ogy of the underlying submanifolds are distinct but 
whose empirical distributions are hard to distinguish 
from a small number of samples. 

We will use two representative manifolds A/i and A/2 
in the application of LeCam's lemma which we de- 
scribe here. See Figure |4j The manifold Mi is a pair 
of 1 — T d-balls (shown in blue) embedded 2t apart 
in joined smoothly at their ends (shown in red). 
The manifold A/2 is a pair of d-annuli (shown in blue) 
embedded 2t apart with outer radius 1 — r and inner 
radius 4t, smoothly joined at both the inner and outer 
ends (shown in red). It is clear from the construction 
that both these manifolds are d-dimensional compact, 
have no boundary and have condition number r. It is 
also the case that 'H(A/i) 7^ HiNh). 



We shall make repeated use of Le Cam's lemma which 



If there exist two manifolds A/i and A/2 with corre- 
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1 1 

Figure 4: The two manifolds Mi and M2, with d = 1, D = 2 



sponding distributions Qi and Q2 in Q such that (i) 
n{Mi) ^ U{M2) and (ii) Qi = Q2 then we say that 
the model Q is non-identifiable. In this case, recover- 
ing the homology is impossible and we write i?„ = 1 
and n{e) = 00. 

4.2 Techniques for upper bounds 

To establish an upper bound we need to construct an 
estimator that achieves the upper bound. In the noise- 
less and tubular noise cases the samples are in a thin 
region around the manifold and our estimator is con- 
structed from a union of balls (of a carefully chosen 
radius) around the sample points. 

In the case of clutter noise and additive Gaussian noise 
samples are concentrated around the manifold but a 
few samples may be quite far away from the manifold. 
In these cases our upper bounds are obtained by an- 
alyzing the performance of the Algorithm [l] (CLEAN) 
with a carefully specified threshold and radius, which 
is used to remove points in regions of low density far 
away from the manifold. Our estimator is then con- 
structed from a union of balls around the remaining 
points. In the case of additive noise with general 

Algorithm 1 CLEAN 

• IN: (Xi)f^i, threshold i, radius r 

1. Construct a graph Gr with nodes {A^}"^]^. In- 
clude edge (Ai, Aj), if ||Ai — Aj|| <r. 

2. Mark all vertices with degree di < (n — l)t. 

• OUT: All unmarked vertices 



known distribution the samples are not expected to 
be concentrated around the manifold. We will first 
use deconvolution to estimate a deconvolved measure 
Pn which we will show is densely concentrated in a 
thin region around the manifold. We will then draw 
samples from this measure, clean them and construct 
a union of balls of appropriate radius around the re- 
maining samples, and show that this set has the right 
homology with high probability. 

We now briefly review statistical deconvolution. We 
refer the interested reader to the work of Fan for 
more details and to [15] for an application related to 



ours. The procedure is similar to kernel density esti- 
mation with a kernel modified to account for the ad- 
ditive noise. For symmetric noise distributions $, we 
consider two kernels JC and ^' such that /C ★ $ — ^, 
where * denotes convolution. The deconvolution es- 
timator is Pn{A) = i/nY^^i=i^{Yi — A). It is easy 
to verify that EPn = P ★ ^' similar to regular kernel 
density estimation with the kernel ^E*. In the noiseless 
case we can even take JC = 'i> = Sq {a, Dirac at 0) 
and get back the empirical distribution of the sample. 
More generally, we will be interested in 5* that satis- 
fies ^'{a; : |x| > e} < 7. for e and 7 that we will later 
specify. 

In each of the above cases our final estimator is con- 
structed from a union of balls around appropriate 
points, and our theorems will show that these have the 
correct homology with high probability. To compute 
the homology one would construct the corresponding 
Cech complex and compute its "boundary matrices" 
(as described in Section [3|. Recovering the homology 
from these matrices consists of linear algebraic manip- 
ulation. There are several fast algorithms to compute 
the homology (either exactly jjj or approximately ;5J 
of the Cech complexes from large point sets in high 
dimensions. 

5 Minimax Rates 

We now derive the minimax rates for homology esti- 
mation under the four noise models described in sec- 
tion [2] There are three quantities of interest: the 
minimax risk the resolution r„ and the sample 
complexity n(e). We write Rn x (similarly for 
Tn ^ o-n) if there are positive constants c and C such 
that c < Rn/dn < C for all large n. Similarly, we write 
n(e) X a(e) if there are positive constants c and C such 
that c < 7i(e)/a(e) < C for all small e. Our analysis 
will show that the rates (as a function of n) are typi- 
cally polynomial for the resolution and exponential for 
the risk. We will often match upper and lower bounds 
on sample complexity and resolution only up to loga- 
rithmic factors, and correspondingly those on the risk 
upto polynomial factors. In this case we will use the 
notation i?„ x* a„, r„ x* a„, and n(e) x* a(e). 

It is worth emphasizing at this point that despite the 
fact that we use two specific manifolds in the appli- 
cation of Le Cam's lemma, the resulting lower bound 
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holds for all manifolds in A4 and all distributions in 
Q. Le Cam's lemma allows one to get a lower bound 
that holds for any estimator by using two carefully 
chosen distributions in Q. The upper bounds are from 
specific estimators and they establish an upper bound 
on the number of samples to estimate the homology of 
any manifold in our class. 

5.1 Noiseless Case 

Theorem 1. For all t < To(a, d), in the noiseless 
case the minimax rate, Rn ^* e^""^ , where To(a, d) is 
a constant which depends on a and d. Also, n(e) x* 
T-'^log(l/e) andTn-* ((1/n) log(l/e))i/'^. 

We provide proof sketches for the lower and upper 
bounds on i?„ separately. 

Lower Bound: Proof Sketch 

To obtain a lower bound on the minimax risk over the 
class Q(r) we will consider the two carefully chosen 
manifolds Mi and M2 described earlier. 

We further need to specify the density on each of the 
manifolds, and we choose two densities from V so that 
the data distributions are as similar as possible while 
respecting the constraint p{x) > a. The construction 
is described in more detail in the Appendix |A.l.l[ but 
for now it suffices to notice that the two densities can 
be constructed to differ only on the sets Wi = Mi \ M2 
and W2 = M2 \ Ml and can be made as low as a on 
one of these sets. A straightforward calculation shows 
that 

TV(pi,p2) < amax(vol(VFi),vol(1^2)) < Cdor'^ 

where the constant Cd depends on d. Now, we apply 
Le Cam's lemma to obtain that 

o o 

for all T < To(a, d). To{a,d) is a constant depending 
on a and d. The lower bound of Theorem [T] follows. 

Upper Bound: Proof Sketch 

In the noiseless case the samples are densely concen- 
trated around the manifold and our estimator is con- 
structed from a union of balls of radius t/2 around 
the sample points. The upper bound on the mini- 
max risk follows from a straightforward modification 
of the results of [TB]. For completeness, we reproduce 
an adaptation of their main homology inference theo- 
rem (Theorem 3.1) here. 

Lemma 2. [NSW] Let < e < t and let 

U = \Ji=iBciX^). Let H = HiU). Let 



sm 



-1 _^ 

8t 



id 02 = sin 



16t ■ 



Then for all n > 



Ci (iog(C2) + log (i)), p(H ^ n{M)) < 5. 

By assumption vol(M) < Co.d for some constant CD,d 
depending on d and D. To obtain a sample complex- 
ity bound we simply choose e = r/2 and this gives 
us n(e) < Ci/(aT'')(C2log(l/(ar'')) -f log(l/e)) which 
matches the lower bound upto the factor of log(l/r). 



Further calculation (see Appendix A. 1.1) then shows 
that as desired i?„ < Ci/t'^ exp{—C2naT'^) for appro- 
priate constants Ci,C2, and t„ < ^(l^^siiiMiM)!/^. 
This establishes Theorem [TJ 

5.2 Clutter Noise 

Theorem 2. For all r < To(a, d), in the clutter noise 
case, Rn i<* e""'^'^ , where To(a, d) is a constant which 
depends on a and d. Also, n{e) x* (l/(7rr'') log(l/e) 
andr„x* (l/(n7r) log(l/e))i/'*. 

Lower Bound: Proof Sketch 

The lower bound for the class Q{tt, t) follows via the 
same construction as in the noiseless case. In the cal- 
culation of the total variation distance (see Appendix 
A. 1.2 1 we have instead 



Ty{qi,q2) < 7ramax(vol(W^i),vol(W2)) < CdTrar'^ 

where Cd depends on d. As before the lower bound 
follows from the application of Le Cam's lemma. 

Upper Bound: Proof Sketch 

As a preliminary step we clean the data samples to 
eliminate points that are far away from, while retaining 
those close to, the manifold. Our analysis shows that 
Algorithm [1] will achieve this, with high probability for 
a carefully chosen threshold and radius. We then show 
that taking a union of balls of the appropriate radius 
around the remaining points will give us the correct 
homology, with high probability. We give an outline 
here and defer details to Appendix ] A. 1.2 



1. We define two regions A tuher{M) and B = 



\ tube2r{M) where r < 



Ci 



a\(AI) 



C2 



vol(M) 

acos<i e2Vol(B^^g) ' 



2. We then invoke Algorithm CLEAN on the data 

with threshold t = (^^^gi^ + 

and radius 2r. Let / be the set of vertices re- 
turned. 

3. Through careful analysis we show that with high 
probability / contains all the vertices from the 
region A and none of the points in region B. 

4. We further show that the retained points form a 
thin dense cover of the manifold M, i.e. |a/ C 

Ue/S2.(A,)}. 
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5. Using a straightforward corollary of Lemma [2] we 
show that this thin dense cover can be used to 
recover the homology of M with high probability. 



Formally, in Appendix |A.1.2 we prove the following 
lemma, 

Lemma 3. If n > max(iVi, A^2), and r < (^9- \/8)§ 
where Ni = 4Klog(K) 



with K 
and iV2 



200, /2 
max ' 1 + "^l°g ( ^ 



1 /, / vol(Af) 
- log ' 

C V 



where ^ = navdr'^ cos'^ (9) and 6 — sin^"'^(r/2r), then 
after cleaning the points {Xi : i G /} are all 
in tube2r{M) and are 2r dense in M. Let U = 
Uie/ B^niXi) with w = r + § and let H = H{U). We 
have that % — 'H{Ai) with probability at least 1 — 6. 

Taking r = (\/9 - \/8)r/4, we obtain the sample 
complexity bound, n(e) < ^(log ^ + log(C3/e)). 
Given this sample complexity upper bound, the upper 
bounds on minimax risk and resolution follow identical 



arguments to the noiseless case (Appendix A. 1.1) 



5.3 Tubular Noise 

Under this noise model we get samples uniformly from 
a tubular region of width a around the manifold. This 
model highlights an important phenomenon in high- 
dimensions. Although, we receive samples uniformly 
from a full D dimensional shape these samples con- 
centrate tightly around a d dimensional manifold. We 
show that with some care we can still reconstruct the 
homology at a rate independent of D. 

Theorem 3. Under the tubular noise model we estab- 
lish the following cases. 

1. If cr > 2t then the model is non-identifiable and 
hence, i?„ — 1 and n(e) > oo. 

2. If c < Cqt, with Co small and t < To(a, d), then 
Rn >i* e"""^ , where ro(a, d) is a constant which 
depends on a and d. Also, n(e) x* I/t'^ and 
r„x* (ilog(l/e))V<i. 

Remark 1. The case when a is very close to r is 
significantly more involved since it involves the exact 
calculation of the volume of the tubular region and es- 
tablishing tight upper and lower bounds here is an open 
problem we are attempting to address in current work. 

Lower bound: Proof Sketch 

1. When d < D and a > 2t the two manifolds Mi 
and M2 that we have considered thus far are still 



identifiable because even when a > t M2 has a 
"dimple" along the co-dimensions that Mi does 
not. To show that the class Q is still not iden- 
tifiable we require a different construction. Con- 
sider the manifolds Mi and M2 with two points 
placed above and below the manifold at a dis- 
tance 2r above their centers along each of the co- 
dimensions. Denote these new manifolds M[ and 
M^. It is clear that 'H{M[) ^ HiM'^), however 
Q'l = Q'2 since the extra points hide the "dimple" 
and the two manifolds cannot be distinguished. 
When d < D, and a < Cqt we return to our 
old constructions of Mi and M2. There is how- 
ever an important difference in that the two man- 
ifolds differ on full I?-dimensional sets, and one 
might suspect that TV{qi,q2) — 0{t^) o r per- 
haps 0(cr^ 



'^t'^). As we show in Appendix A. 1.3 



however, TV{qi, (72) is still 0{t'^), and we recover 
an identical lower bound to the noiseless case. 

Upper bound: Proof Sketch 

We are interested in case when a < Cqt (in particular 
a < r/24 will suffice). Our proof will involve two main 
steps which we sketch here. 

1. We first show that if we consider balls of suffi- 
ciently large radius e (compared to tr) then the 
probability mass in these balls is 0{e'^). This 
is a manifestation of the phenomenon alluded 
to earlier: inside large enough balls the mass is 
concentrated around the lower dimensional man- 
ifold. Precisely, define = infp^M Q{Be{p))- In 
Lemma[9]in the Appendix, we show that, if e ^ cr 
is large, k^ is of order r2(e'^). 

2. There is however a disadvantage to considering 
balls that are too large. The homology of the 
union of balls around the samples may no longer 
have the right homology. Using tools from NSW, 
we show in the Appendix that we can balance 
these two considerations for manifolds with high 
condition number, i.e. provided a < t/24, we can 
choose balls that are both large relative to a and 
whose union still has the correct homology. 

We will prove the following main lemma in the Ap- 
pendix. 

Lemma 4. Let be the e-covering number of the 
submanifold M. Let U — U"=i ^e+r/2(A'i). Let TL = 
n{U). Then ifn> (log(iV,) -Mog(l/(5)), P(H ^ 
n{M)) <5 as long as a < e/2 and e < (^-^)^ . 

Notice, that we require a < which is satisfied 

if cr < t/24 (for instance). To obtain the upper bound 
set e = 2a, and observe that = 0{l/e'^) = 0{1/t'^) 
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and k, = 0{e'^) = 0{t'^). This gives us that if 
n ^ ^(log(^) + log(j)) we recover the right homol- 
ogy with probabihty at least 1^6. The upper bound 
on minimax risk and resolution follows from similar 
arguments to those made previously. 

5.4 Additive Noise 

For additive noise we consider two cases. In the first 
case, we derive the minimax rates for additive Gaus- 
sian noise under the somewhat restrictive assumption 
that C^/Da < t. This problem is related of the prob- 
lem of separating mixtures of Gaussians (which corre- 
sponds to the case where the manifold is a collection 
of points and 2t is the distance between the closest 
pair). In this case have the following theorem. 

Theorem 4. For all t < To(a, d) and 8\/Da < t, 
Rn X* e"""^ , where TQ{a,d) is a constant which de- 
pends on a and d. Also, n(e) x* (1/r'') log(l/e) and 
r„x* ((l/n)log(l/e))^/^ 

As in the clutter noise case we need to first clean the 
data and then take a union of balls around the points 
which survive. We analyze this procedure in the Ap- 
pendix. 

5.4.1 Deconvolution 

Here we consider more general known noise distribu- 
tions but work over the class of distributions Q{^) over 
manifolds with r fixed. We first use deconvolution to 
estimate a deconvolved measure Pn which is concen- 
trated around the manifold. We then draw samples 
from this measure, clean them and construct a union 
of balls H around these samples, and show that H has 
the right homology with high probability. The class of 
noise distributions we will consider satisfy the follow- 
ing assumption on its density. 

Assumption 1. Denote p{R) = mi\t\^<R\^*{t)\, 
where R > 0, |<|oo ~ maxi<j<„i |tj | and ^*{t) is the 
Fourier transform of the symmetric noise density 
We assume p{R) > 0. 

This is a standard assumption in the literature on de- 
convolution (see |9l E]), since as described deconvo- 
lution requires us to divide by the Fourier transform 
of the noise which needs to be bounded away from 
for the procedure to be well behaved. The assump- 
tion is satisfied by a variety of noise distributions in- 
cluding Gaussian noise. Our main result says that for 
this broad class of noise distributions the deconvolu- 
tion procedure described above will achieve an optimal 
rate of convergence. 

Theorem 5. In the additive noise case with t fixed 
for <I> satisfying Assumption^ i?„ x e^". Hence, 
n(e)xlog(l/e). 



Lower Bound: Proof SketchTo obtain the lower 
bound one can consider the same construction from 
the previous subsection with additive Gaussian noise. 
If r is taken to be fixed we obtain the desired bound. 

Upper Bound: Proof Sketch Our proof of the up- 
per bound follows similar lines to that of Koltchinskii 
[15] . We deviate in two significant aspects. Koltchin- 
skii only assumes an upper bound on the density, 
which he shows is sufficient to estimate weak geomet- 
ric characteristics like the dimension of the manifold. 
To show that we can accurately reconstruct its homol- 
ogy we require both an upper and lower bound and 
our methods are quite different. Koltchinskii uses an 
epsilon net of the entire compact set containing the 
manifold critically in his construction and his proce- 
dure is thus not implementable/practical. Our algo- 
rithm instead draws a small number of samples from 
the deconvolved measure and uses those to estimate 
the homology resulting in a practical procedure. We 
prove the following upper bound in the Appendix. 

Lemma 5. Given n samples from Q($) with $ satis- 
fying Assumption^ there exist Ci,C2,ci > such 
that P{H{H) ^ H{M)) < Cie-'^i", where H is a 
union of balls of radius ^^j^ centered around m > 
samples drawn from the deconvolved measure Pn with 
a kernel ^ with parameters 7, e (specified in the proof). 
The samples are cleaned using the deconvolved measure 
by considering balls of radius 4e at a threshold 2j. 

Remark 2. The cleaning procedure we use here is dif- 
ferent from the Algorithm CLEAN. We remove points 
around which a ball of appropriate radius has low prob- 
ability mass under the deconvolved measure. This is 
equivalent to using the deconvolved measure in place 
of the k-NN density estimate implicitly constructed by 
the CLEAN procedure. 

Simple calculations show that this lemma together 
with the lower bound give the exponential minimax 
rate described in Theorem [5] 

6 Conclusion 

We have given the first minimax bounds for homology 
inference. These bounds give insight into the intrinsic 
difficulty of the problem under various assumptions. 
Our bounds show that it is often possible to estimate 
the homology of a manifold at fast rates independent 
of the ambient dimension. 

Actual implementation of homology inference has be- 
come tractable thanks to advances in computational 
topology. However, as our proofs reveal, recovering 
the homology requires the careful selection of several 
tuning parameters. In current work, we are develop- 
ing methods for choosing these parameters in a statis- 
tically sound, data-driven way. 
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A Appendix — Supplementary 
Material 

A.l Key technical lemmas from |16| 

We will need two technical lemmas, which follow from 

Lemma 6 (Ball volume lemma, Lemma 5.3 in 
[16]). Let p G M. Now consider A ^ M n B^{p). 
Then vol{A) > {cos{e)y vol{Bf{p)) where Bf{p) is 
the a d- dimensional ball in the tangent space at p, 



sm 



2t • 



Next, consider a collection of balls {i?r(pj)}i=i.- -,ra 
centered around points pi on the manifold and such 
that M C yj\^-^Br{pi). 

Lemma 7 (Sampling lemma, Lemma 5.1 in Tj]). 

Let Ai = Br{pi) he a collection of sets such that u'^^^j 
forms a minimal cover of M . If Q{Ai) > a, and 



1 



n > 



log I + log 



then w.p. at least 1 — 6/2, each Ai contains at least 
one sample point, and M C '^i=iB2r{xi). Further we 



have that I < 



vol(M) 



A. 1.1 Proofs for the noiseless case 

Lower bound Here we describe the densities on the 
two manifolds Mi and M2. There are two sets of inter- 
est to us: Wi = Ml \ M2 which corresponds to the two 
"holes" of radius 4t in the annulus, and W2 = M2\Mi 
which corresponds to the d-dimensional piece added to 
smoothly join the inner pieces of the two annuli in M2. 

By construction, vol(W^i) = 2wc((4t)'' where Vd is the 
volume of the unit d-ball. vo\{W2) is somewhat tricky 
to calculate exactly due to the curvature of W2 but 
it is easy to see that vol(W2) is also 0{t'^) with the 
constant depending on d. 

One of the densities is constructed in the following way, 
on the set of larger volume (between Wi and W2) we 
set p{x) — a, and evenly distribute the rest of the mass 
over the remaining portion of the manifold (we are 
guaranteed that the mass on the rest of the manifold 
is at least a since otherwise the constraint p{x) > a 
can never be satisfied). 

The other density is constructed to be equal (to the 
first density) outside the set on which the two mani- 
folds differ. The remaining mass is spread evenly on 
the set where they do differ. We are again guaranteed 
that p{x) > a by construction. 

Let us now calculate the TV between these two den- 
sities. This is just the integral of the difference of 



the densities over the set where one of the densities 
is larger. Since the two densities are equal outside 
Wi U W2 and disjoint over Wi U W2 it is clear that 

TV{pi,p2) = amax(vol(W^i),vol(T4^2) < 0(aT'*) 

with the constant depending on d. The lower bound 
follows from the calculations in the main paper. 

Upper bound The NSW lemma tells us that for n > 



Cl (log(C2)+ log (i)), with Cl 



ol(M) 



a CDS'* 6livol(S'', ) 



r^i C2 



vol(M ) Q _ gj^^ 1 e g^j^j ff _ gjj^ 1 e 

cos'' fl2Vol(i3^^g) ' ^ 8r ^ 16t ' 

have V{n ^ n{M)) < 6. 

By assumption, we have vol(Af) < C. We further 
take e = t/2. It is clear that in Ci and C2 all terms 
except the ball volumes are constant. This gives us 
that Ci = Ci/{aT'^) and C2 = C2/(aT'*). 

Now, the NSW lemma can be restated as if n = 
Ci/T^(log(C2/T'') -I- log(l/^)) we recover the homol- 
ogy with probability at least 1 — (5. Notice that this 
means that the minimax risk < 6. 

A straightforward rearrangement of this gives us 

Rn < C2/{aT'^)exp{-naT'^/Ci) 

for appropriate Ci , C2 . To bound the resolution we 
rewrite this as 



Rn < exp 



One can verify that if 



Ci 



log 



C2 



< ^ lognlog(l/£) 



for an appropriately large C, we have Rn < e as de- 
sired. 

A. 1.2 Proofs for the clutter noise case 

Lower bound This is a straightforward extension of 
the noiseless case. The densities are constructed in an 
identical manner. The contribution to the densities 
from the clutter noise is identical in each case. As in 
the analysis for the noiseless case we bound the total 
variation distance between the two densities. We have 
an additional factor of tt which is the mixture weight 
of the component corresponding to the density on the 
manifold. 

TViqi,q2) = 7ramax(vol(M^i),vol(l¥2)) < Cdnar'^ 

Given this bound the calculations are identical to those 
in the noiseless case. 

Upper bound As a preliminary step we will need to 
clean the data to eliminate points that are far away 
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from the manifold. Our analysis will show that Algo- 
rithm [l] will achieve this, with high probability. We 
will then show that taking a union of balls of the ap- 
propriate radius around the remaining points will give 
us the correct homology, with high probability. 

Let a — inixeM p{x), which is strictly positive by as- 
sumption. Define, A = tuber (M) and B — M.^ — 
tube2r(M) where r < M_v^. Following [TT], we de- 
fine as = miteAQ{Bs{t)) and /3s = sup^^g Q{Bs{t)) 
where s — 2r. Then > "°oi(Btxf ^ +TTavdr'^ cos"* 9 = 



a and /3s < 



vqs" (1 — tt) 



vol(Box) 



where 9 = sm-\^). The 



second term of the bound on as follows in two steps: 
first observe that for any point x in A, Bs{x) D Bj.{t) 
where t is the closest point on Af to x. Now, we use 
Lemma [6] to bound Q{Br{t)). 

We will now invoke Algorithm CLEAN on the data with 
threshold t = ( "Zlui\z:'> + ^^^^<^55le^ ^^^^ ^^^^^^ 



vol(Box) ' 2 

2r. Let / be the set of vertices returned 



Define the events £i 



{X, : I e 1} D {X, e 



A} and {X^ : i E r} D {Xi E B}j and £2 = 

|m C [j^^J B2riXi)Y We will show that £1 and £2 
both hold with high probability. 

For £1 to hold, we need /? to be not too close to a, 
in particular (5 < a/2 will sufhce. This happens with 
probability 1, for t small if d < D. By Lemma [T3| 
in the Appendix, £1 happens with probability at least 
1 — (5/2, provided that n > 4k log k, where 



max 1 



200 /2 



,4 



Now we turn to £2- Let pi, . . . ,pN E M be such that 
Bripi), ■ ■ ■ , Bj.(pn) forms a minimal covering of M. 



From Lemma 7 



A. 



= BAp,) 
QiA,) 



we have that N < 



> 



vol(M) 



Let 



■navd.r'^ cos'^ {9) 



> 



vol(Box) 
navdr'^ cos"^ (9) = 7. 

Using again LemmaPzl if ?i > ^ (log -I- log (|)), then 
with probability at least 1 — S/2, each Ai contains at 
least one sample point, and hence M C Uie/ ^2r (Ai), 
which implies that £2 holds. 

Combining these we are now ready to again apply the 
main result from NSW. We restate this lemma in a 
slightly different form here. 

Lemma 8. [NSW] Let S be a set of points in the 
tubular neighborhood of radius R around M . Let U = 



V}x€sBe{x). If S IS R-dense m M thenUiU) =H{M) 
for all R<{ V9- ^/8)r, if e = 



R+T 

2 ■ 



Combining the previously established facts with the 
lemma above we obtain Lemma[3]from the main paper. 
Taking r = (-\/9 — -\/8)r/4 in that lemma, we can see 
that if n > ^^{\og^ + log(C3/e)) then we recover 
the correct homology with probability at least 1 — e. 

This is a sample complexity upper bound. Corre- 
sponding upper bounds on the minimax risk and res- 
olution follow the arguments of the noiseless case. 

A. 1.3 Proofs for the tubular noise case 

Lower bound In this setting we get samples uni- 
formly in a full dimensional tube around the manifold. 
We are interested in the case when a < Cqt for a small 
constant Cq. 

Let us denote the density qi at a point in the tube 
around Mi by 9i and the density (72 around M2 by 92 ■ 
Since, it is not straightforward to decide whether 9i < 
92 or not we will need to consider both possibilities. 
We will show the calculations assuming 9i < 6*2 (the 
other calculation follows similarly). 

Now, remember from the definition of total variation 
TV — (71(G) — 92 (G) where G is the set where qi > q2- 
We need an upper bound on total variation and so it 
suffices to use TV < (7i(G+) — 172 (G~) where G+ and 
G~ are sets containing and contained in G respectively. 

Since, 9i < 92 we have G is contained in the holes (of 
radius 4t) of the two annuli, and G contains a strip of 
width at least 2r — 2a in these holes. These are G"*" 
and G~ . 

We need to upper bound the mass under qi in G+ 
and lower bound the mass under (72 in G^. We can 
now follow the a similar argument to the one made 
below (in the tubular noise upper bound) to obtain 
bounds on the various volumes. In each case, the 
volume of the tubular region is il{vo\{M)a^~'^), and 
both Ml and M2 have constant volume, in particular 
ci < vol(M) < Gi. Giving us that the tubular region 
has volume fl{a^^'^). 

It is also clear that both G+ and G~ have volumes 
that are f2(cr^~'^T'') (these can be calculated exactly 
since they are cylindrical with no additional curvature 
but we will not need this here). Here we use that a 
is not too close to t (and in particular is at most a 
constant fraction of t). 

Since qi and 92 are both uniform in their respective 
tubes, it follows that 



TV{qi,q2)<n 
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Notice, that we assumed 9i < 02 above. The other cal- 
culation is nearly identical and we will not reproduce 
it here. 

Upper bound Denote by the tube of radius a 
around M. Recall that we are interested in the case 
when a <ti T, and e = t/2. 

Lemma 9. If a (in particular e > 2(t will suffice) 



Proof. For any p Cz M, 



QiBM) 



voliB.jp) n M^) 
vol{M„) 



We will prove the claim by deriving derive an up- 
per bound on the denominator and a lower bound 
on the numerator using packing/covering arguments, 
both bounds holding uniformly in p. 

Upper bound on vol{Ma) 

We consider a covering of M by 7-balls of d dimen- 
sions, and denote the number of balls required iV^, 
and the centers C^. It is clear N-y is bounded by the 
number of balls of radius 7/2 one can pack in M . A 
simple volume argument then gives 

for some constant C . Given this covering of A/, it is 
easy to see that 7 -f ct _D-dimensional balls around each 
of the centers in covers the tubular region. Thus, 
we have 

vol{M„) < vdN,{j + af < voC'^l^ij + a)^ , 

for any 7. Selecting 7 = cr, we have 

vol{M^) < CDAVol{M)a^-'^ 

for some constant Cjj^d depending on the manifold and 
ambient dimensions, independent of a. 

Lower bound on vol{B^{p) n M^) 
Define 

A{p)=MnB,^„{p), 

B{p)^MnB,{p), 
B,ip) = M,nB,{p). 

Denote with No- the number of points we can "pack" 
in A{p) such that the distance between any two points 
is at least 2a. Denote the points themselves by the set 
C. Then, 

vol{B„) > N^voa^ 



where vd is the volume of the unit ball in D- 
dimensions. To see this just note that every point that 
is at most a away from any point in C is contained in 
B„, and these sets are disjoint so the union of cr balls 
around C is contained in B^r. 

Now, to prove a lower bound on TVo- we invoke some 
ideas from [16]. Consider, the map / described in 
Lemma 5.3 in [16], which projects the manifold onto 
its tangent space, and observe its action on A{p). It 
is clear by their discussion that this map projects the 
manifold onto a superset of a ball of radius (e— (t) cos 0, 
for = sin~^(^^). In addition to being invertible, 
this map is a projection, and only shrinks distances 
between points. So if we can derive a lower bound on 
the number of points we can "pack" in this projection 
then it is also a lower bound on N^. Now, the set is 
just a ball in d-dimensions of radius (e — a) cos0. Us- 
ing, the fact that 2a balls around each of the points in 
C must cover this set a simple volume argument shows 

N,{2af > Vdiie ~ a) cos0f, 



I.e. 



N„ > Co.d 



(e — a) cos ( 



which gives a lower bound. 

Putting the upper and lower bound together, we get 



h = inf Q{B,{p)) 



vol{M)aD-d 
[(e — ct) cos6']° 



(e — a) cos ( 



^^'•^ vol{M) ' 
for some quantity ^, independent of a. 



□ 



We will prove the following main lemma. 
Lemma 10. Let be the e-covering number of the 
submanifold M. Let U — ljr=i ^e+^'/zl^'^i)- T~L = 
n{U). Then tfn> ^ (log(iV,) + log(l/(5)), P(H ^ 

■H{M)) <S as long as a < e/2 and e < ^^"^^^ . 

Proof. This is a straightforward consequence of 
Lemma [8] and Lemma □ 

A. 1.4 Proof of Theorem [4] (additive case) 
Lower Bound 



From Lemma 14 we see that convolution only decreases 
the total variation distance, and so the lower bound for 
the noiseless case is still valid here. 
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Upper Bound 

We will again proceed by a similar argument to the 
clutter noise case. Let ^J~Da < r, R = 8r and 
s = 4r and set as = infpg^ (5(i3s(p)) and l3s = 
suPpgsQ(Bs(p)), where A = tube^(M), B = - 
tube r{M). 

As in the clutter noise case, we will need the two events 
£i and £2 to hold with high probability. 

We will use the following version of a common in- 
equality, established by [T7j- 

Lemma 11. For a D- dimensional Gaussian random 
vector 

P(||e|| >Vf)< {ze^-'f/^ 
where z — -3^ 



Using this inequality. 



D/2 



and 



e|| > 4r) < (16exp{-15}) 



P(||e|| > 2r) < (4exp{-3})^/' 



7- 



Observe that these are both constants. Next, it is easy 
to see that 

as > Q{Bs-r{p)) > avdr'^ {cos df {I - j) = a, 

where 9 = sin^"'^(r/(2T)), and 

A <t'D(8r)^i = /3. 

As in the clutter noise, we need /? to be sufficiently 
smaller than a if we are to successfully clean the data. 
As we are interested in the case when r is small, if 
D > d then we can take /3 < a/2, while, ii D = d then 
we will need that the dimension is quite large (observe 
that both 7 and t tend to zero rapidly rapidly as D 
grows) . 



We are now in a position to invoke the Lemma 13 
to ensure £1 holds with high probability for n large 
enough. Further, one can see that the mass of an r/2- 
ball close to manifold is at least 

Q{A,) > avd{l - l){cos 9 f{r/2f 

for 9 = sin^^(r/(4T)). This quantity is also 0{r'^) as 
desired, and for n large enough we can ensure £2 holds 
with high probability. Under the condition on cr, and 



we have r < 



{V9-V8) 



-. At this point we can invoke 
Theorem 5.1 from [TT] to see that for n x* we 
recover the correct homology with high probability. 



A. 1.5 Deconvolution 

Upper bound Recall, that the kernel "if satisfies 

-^{x : |a;| > e} < 7 (2) 

with e and 7 being small constants that we will specify 
in our proof. 

The starting point of our proof will be a uniform con- 
centration result from Koltchinskii [T5] . 

Lemma 12. Consider the event 

A = {max|P„(B2,(a;)) - P^{B2,{x))\ < 7} 

X 

For any small constants e and 7, there exists q G (0, 1) 
such that 

PiA") < 4g" 

This lemma tells us that the deconvolved measure is 
uniformly close to a smoothed (by the kernel ^) ver- 
sion of the true density. 

Our first step will be to draw 



1 / /2 
m > — (log ; + log ( - 



samples from P„, where u! — mfx^M Pn{B2e{x)), and 
I is the 2e covering number of the manifold, and 
S — 8(7". Denote, this sample Z. We know that 

7 < v°l(M) 

^ — cos''(e)i;d(2e)<i " 

Let us first show that we can choose e and 7 so that uj 
is at least a small positive constant. 

W = inf Pn{B2e{x)) 

> inf P^(S2e(.x))-7 



Notice that, 



P^{B2e)>P{B,)^ix:\x\<e) 
So, we have, 

UJ > inf P(i?e(x))(l-7)-7 

xeM 

Using the ball volume lemma we have, 

UJ > avdf-'^ cos''- 9 {\ - -f) - 

where 9 = sin~^(e/2T). Notice, that r is a fixed con- 
stant, and e and 7 are constants to be chosen appro- 
priately. It is clear that for 7 < Cd.r^, with Cd,T small 
we have 

Ll) > c 

for a small constant c which depends on T,d and our 
choices of e and 7. 

We now use the sampling lemma [7] to conclude that 
w.p. at least 1 — iq""-, 
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1. The TO samples are 4e dense around M. 

2. M C U^iB4,(xO 

Our next step will be a cleaning step. This cleaning 
procedure differs from the Algorithm CLEAN in that 
we use the deconvolved measure to clean the data. In 
particular, we will remove all points from Z for which 
Pn{B4e{Zi)) < 2j. Denote the remaining points by 
W. Our estimator will then be constructed from 



To analyze this cleaning procedure, we use the uniform 
concentration lemma 12 above, and consider the case 



when event A happens. 

1. All points far away from M are eliminated: 

In particular, for any point x if we have 

dist(B4,(a;),M) > e 

then the corresponding point is eliminated. 

To see this is simple. We eliminated all points 
with deconvolved empirical mass P„(S4e) < 2j. 
Since, we are assuming event A happened, we 
have for any remaining point Pii,(i?4(:) > 7. Now, 
we have that 



>e}<7 



From this we see that some part of i34e must be 
within e of M, and we have arrived at a contra- 
diction. 

2. All points close to AI are kept: In particular, 
for any point x if 

dist(a;,M) < 2e 

then the corresponding point is kept. 

We need to show Pn{B4^{x)) > 2j. Notice, that 
Pn{B4^{x)) > Pn{B2ei'!T{x))) whcrc 7r(x) is the 
projection of x onto M. This quantity is just w. 

To finish, we need to show that we can choose 
e and 7 such that uj > 27. Since, w > 
avdf^ cos'^ 9 {\ — 7) — 7 which as a function of 7 
is continuous, bounded from below by a constant 
depending on r, d and e and monotonically in- 
creasing as 7 decreases we have for 7 small enough 

w > 27 

3. The set H has the right homology: We have 
shown that the cleaning eliminates all points out- 
side a tube of radius 5e, and further keeps all 
points in a tube of radius 2e. From the sampling 



result we know the points that we keep are 4e 
dense and that M C U™ji34e(a;i). We can now 
apply lemma [8] to conclude that H has the right 
homology provided 



e < 



(V9 - V8)7 



Since t is a fixed constant we can always choose 
e small enough to satisfy this condition. To re- 
view, we need to select 7 and e to satisfy three 
conditions 

(a) UJ > avd^"^ cos'' 6{1 — 7) — 7 has to be atleast 
a small positive constant. 

(b) a; > 27 

(c) e < 

Each of these can be satisfied by choosing 7 and 
e small enough. 

Now, returning to to. We have 



1 / /2 

TO > — log / + log 

w V 



where ui = inf^^gM PniB2e{x)) , and I is the 2e cov- 
ering number of the manifold I < — vo1(m) 
and S = 8(7". It is clear that all terms except 
those in n are constant. In particular it is easy to 
see that 

m > Cn 
for C large enough is sufficient. 

From this we can conclude with probability at least 
1 — 8q" our procedure will construct an estimator with 
the correct homology. Since, q G (0, 1) the success 
probability can be re-written as at least 1 — e""^" for c 
small enough. Together this gives us the deconvolution 
lemma from the main paper. 

A. 2 Additional technical lemmas 
A. 2.1 The cleaning lemma 

In this section we sharpen Lemma 4.1 of IT7|, also 
known as the A-B lemma, by using Bernstein's in- 
equality instead of Hocffding's inequality. This modi- 
fication is crucial to obtain minimax rates. 

Lemma 13. Let Ps < P < a/2 < as/2. If n > 
4/3 log /3, where 



, . 200 , I 

P = max I 1 + ( ^ 



then procedure CLEAN(^^-i^) will remove all points in 
region B and keep all points in region A with probabil- 
ity at least 1 — 6. 
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Proof. We use the notation established in section 5.2 
We first analyze the set A. 

For a point Xi in A, let q = q{i) — Q{Bs{Xi)), and 
define, 

where I denotes the indicator function. Notice that 
the random variables {Zj,j ^ i} are independent 
Bernoulli with common mean q. 

We will consider two cases. 

Case 1: a < q < 2a. 
Notice that if 



< 



the point X.^ will not be removed. By Bernstein's in- 
equality, the probability that Xi will instead be re- 
moved is 



the point Xi will not be removed. By Bernstein's in- 
equality. 



n- 1 ^ ^ - 4 



< 



exp 



< exp 



1 (n- 1)(q/4)2 

2 a/2 + a/12 



--(n-l)a 



Putting all the pieces together, we obtain that the 
cleaning procedure succeeds on all points with proba- 
bility at least nexp { — ^^{n — This requires, 



200 / /I 
n - 1 > I log n log I - 



I.e. 



< 



exp 



< exp 



1 (n- l)(a/4)^ 

2 2a + a/12 

3 
200 



, 200, /1\ 200, 



116 < 1/2, then 1 + ^ log (i) > so it is enough 
to solve 



{n — l)a 



Case 2: q > 2a. 
In this case if 



n > X + X log n 



with a; = 1 + ^log(l). The resuh of the lemma 
follows. □ 



<a- 



3a 



3 S q 



the point Xi will be removed. Another application of 
Bernstein's inequality yields 



A. 2. 2 Convolution only decreases total 
variation 



Lemma 14. Let P and Q two probability measures in 
M.^ with common dominating measure ii. Then, 



> a- 



3a 



< 



< 



< 



exp 



exp 



exp 



1 (n-l)(g-3a/4)2 

2 q+{q-3a/4)/3 



(n — l)a 



9a^ 3a 
32^ ~ T 



Now, consider a point Xi in the region B, and define 
q and the ZjS in an identical way. This time if 



1 X - a 

—^l^Z,-q<-, 



TV(P*$,Q*$) < QTV(P,Q). 



where * denotes deconvolution and $ is a probability 
measure on . 



Proof. This is a standard result, but we provide a 
proof for completeness. Let denote the Lebesgue 
density of the probability distribution P*<i>, i.e. 



p-k(/){z)= / (j){z — x)p{x)dii{x), zG 



Similarly, g*^ denotes the analogous quantity for Q*^. 
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Then, 



2TV(P*$,Q*$) = / \pir(j){z)-q*(t){z)\dz 

(l){z — x)p{x)diJ,{x) 



— / 4>iz — x)p{x)dii{x) dz 

j (t){z - x){p{x) 
—q{x))diJ,{x)\ dz 

< [ f \cj)iz - x)ip{x) 
Jmo J 

—q{x))\ dfi{x)dz 

< 4>{z — x)dz\p{x) — q{x)\dn{x) 
J Js.o 

= j \{p{x)-q{x)\dn{x) 
= 2TV(P,g) 



□ 



